{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ur8xi4C7S06n"
   },
   "outputs": [],
   "source": [
    "# Copyright 2021 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tvgnzT1CKxrO"
   },
   "source": [
    "# Overview\n",
    "\n",
    "In this notebook, you will learn how to load, explore, visualize, and pre-process a time-series dataset. The output of this notebook is a processed dataset that will be used in following notebooks to build a machine learning model.\n",
    "\n",
    "### Dataset\n",
    "\n",
    "[CTA - Ridership - Daily Boarding Totals](https://data.cityofchicago.org/Transportation/CTA-Ridership-Daily-Boarding-Totals/6iiy-9s97): This dataset shows systemwide boardings for both bus and rail services provided by Chicago Transit Authority, dating back to 2001.\n",
    "\n",
    "### Objective\n",
    "\n",
    "The goal is to forecast future transit ridership in the City of Chicago, based on previous ridership."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "i7EUnXsZhAGF"
   },
   "source": [
    "## Install packages and dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Restarting the kernel may be required to use new packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wyy5Lbnzg5fi"
   },
   "outputs": [],
   "source": [
    "%pip install -U statsmodels scikit-learn --user"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** To restart the Kernel, navigate to Kernel > Restart Kernel... on the Jupyter menu."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XoEqT2Y4DJmf"
   },
   "source": [
    "### Import libraries and define constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pRUOFELefqf1"
   },
   "outputs": [],
   "source": [
    "from pandas.plotting import register_matplotlib_converters\n",
    "from statsmodels.graphics.tsaplots import plot_acf\n",
    "from statsmodels.tsa.seasonal import seasonal_decompose\n",
    "from statsmodels.tsa.stattools import grangercausalitytests\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oM1iC_MfAts1"
   },
   "outputs": [],
   "source": [
    "# Enter your project and region. Then run the  cell to make sure the\n",
    "# Cloud SDK uses the right project for all the commands in this notebook.\n",
    "\n",
    "PROJECT = 'your-project-name' # REPLACE WITH YOUR PROJECT NAME \n",
    "REGION = 'your-region' # REPLACE WITH YOUR LAB REGION\n",
    "\n",
    "#Don't change the following command - this is to check if you have changed the project name above.\n",
    "assert PROJECT != 'your-project-name', 'Don''t forget to change the project variables!'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = 'total_rides' # The variable you are predicting\n",
    "target_description = 'Total Rides' # A description of the target variable\n",
    "features = {'day_type': 'Day Type'} # Weekday = W, Saturday = A, Sunday/Holiday = U\n",
    "ts_col = 'service_date' # The name of the column with the date field\n",
    "\n",
    "raw_data_file = 'https://data.cityofchicago.org/api/views/6iiy-9s97/rows.csv?accessType=DOWNLOAD'\n",
    "processed_file = 'cta_ridership.csv' # Which file to save the results to"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import CSV file\n",
    "\n",
    "df = pd.read_csv(raw_data_file, index_col=[ts_col], parse_dates=[ts_col])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Model data prior to 2020 \n",
    "\n",
    "df = df[df.index < '2020-01-01']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop duplicates\n",
    "\n",
    "df = df.drop_duplicates()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort by date\n",
    "\n",
    "df = df.sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the top 5 rows\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO 1: Analyze the patterns\n",
    "\n",
    "* Is ridership changing much over time?\n",
    "* Is there a difference in ridership between the weekday and weekends?\n",
    "* Is the mix of bus vs rail ridership changing over time?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize plotting\n",
    "\n",
    "register_matplotlib_converters() # Addresses a warning\n",
    "sns.set(rc={'figure.figsize':(16,4)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explore total rides over time\n",
    "\n",
    "sns.lineplot(data=df, x=df.index, y=df[target]).set_title('Total Rides')\n",
    "fig = plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explore rides by day type: Weekday (W), Saturday (A), Sunday/Holiday (U)\n",
    "\n",
    "sns.lineplot(data=df, x=df.index, y=df[target], hue=df['day_type']).set_title('Total Rides by Day Type')\n",
    "fig = plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explore rides by transportation type\n",
    "\n",
    "sns.lineplot(data=df[['bus','rail_boardings']]).set_title('Total Rides by Transportation Type')\n",
    "fig = plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO 2: Review summary statistics\n",
    "\n",
    "* How many records are in the dataset?\n",
    "* What is the average # of riders per day?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[target].describe().apply(lambda x: round(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO 3: Explore seasonality\n",
    "\n",
    "* Is there much difference between months?\n",
    "* Can you extract the trend and seasonal pattern from the data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show the distribution of values for each day of the week in a boxplot:\n",
    "# Min, 25th percentile, median, 75th percentile, max \n",
    "\n",
    "daysofweek = df.index.to_series().dt.dayofweek\n",
    "\n",
    "fig = sns.boxplot(x=daysofweek, y=df[target])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show the distribution of values for each month in a boxplot:\n",
    "\n",
    "months = df.index.to_series().dt.month\n",
    "\n",
    "fig = sns.boxplot(x=months, y=df[target])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Decompose the data into trend and seasonal components\n",
    "\n",
    "result = seasonal_decompose(df[target], period=365)\n",
    "fig = result.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Auto-correlation\n",
    "\n",
    "Next, we will create an auto-correlation plot, to show how correlated a time-series is with itself. Each point on the x-axis indicates the correlation at a given lag. The shaded area indicates the confidence interval.\n",
    "\n",
    "Note that the correlation gradually decreases over time, but reflects weekly seasonality (e.g. `t-7` and `t-14` stand out)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_acf(df[target])\n",
    "\n",
    "fig = plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export data\n",
    "\n",
    "This will generate a CSV file, which you will use in the next labs of this quest.\n",
    "Inspect the CSV file to see what the data looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[target]].to_csv(processed_file, index=True, index_label=ts_col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You've successfully completed the exploration and visualization lab.\n",
    "You've learned how to:\n",
    "* Create a query that groups data into a time series\n",
    "* Visualize data\n",
    "* Decompose time series into trend and seasonal components"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "ai_platform_notebooks_template.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "environment": {
   "kernel": "python3",
   "name": "tf2-gpu.2-6.m82",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m82"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
